LLM Inference Servers: Complete Guide

🎯 The Fundamentals: How LLM Inference Actually Works

🍽️ The Restaurant Analogy

Think of an LLM like a master chef in a restaurant. When you give them a recipe (prompt), they don't cook the entire meal at once. Instead, they prepare it one ingredient at a time, tasting and adjusting as they go. Each "ingredient" is like a token (word or part of a word) that the model generates one by one.

Step 1: Understanding Tokens

What are Tokens?

Hello

my

name

is

Alice

Tokens are the basic units that LLMs understand - they can be words, parts of words, or even punctuation!

Step 2: The Two Phases of Inference

🔄 PREFILL Phase

Process the entire prompt in parallel

Fast but memory-intensive

📝 DECODE Phase

Generate tokens one by one

Slow but predictable

🏃‍♂️ The Reading vs Writing Analogy

Prefill is like speed-reading an entire book (your prompt) very quickly to understand the context.

Decode is like writing a response letter word by word, thinking carefully about each word before writing the next one.

Step 3: What Makes This Challenging?

🚨 The Big Problem: Each new token depends on ALL previous tokens!

This creates a memory bottleneck and limits how fast we can generate text.

# Simple example of how inference works
prompt = "The weather today is"

# Prefill: Process entire prompt at once
context = model.process_prompt(prompt)

# Decode: Generate one token at a time
tokens = []
for i in range(max_tokens):
    next_token = model.predict_next(context + tokens)
    tokens.append(next_token)
    if next_token == "<END>":
        break

# Result: "The weather today is sunny and warm"
            

🧠 KV Cache: The Memory That Makes Everything Fast

🎓 The Study Group Analogy

Imagine you're in a study group discussing a complex topic. Instead of re-reading the entire textbook every time someone asks a question, you keep detailed notes (KV Cache) of everything discussed so far. When a new question comes up, you can quickly reference your notes instead of starting from scratch!

What is KV Cache?

🔑 Key-Value Cache Explained

Keys (K): Help the model find relevant information

Values (V): Store the actual information content

K1
V1

K2
V2

K3
V3

K4
V4

K5
V5

Each token gets its own Key-Value pair stored in memory

Why is KV Cache Necessary?

❌ Without KV Cache (Inefficient)

Token 1: Process [The]

Token 2: Process [The, cat] ← Recalculate everything!

Token 3: Process [The, cat, sat] ← Recalculate everything again!

✅ With KV Cache (Efficient)

Token 1: Process [The] → Store K1,V1

Token 2: Use K1,V1 + Process [cat] → Store K2,V2

Token 3: Use K1,V1,K2,V2 + Process [sat] → Store K3,V3

The Memory Challenge

🔥 For a 70B model like Llama 3.3:

• Each token needs ~800KB of KV cache

• A 2048-token conversation = 1.6GB just for cache!

• This grows linearly with context length

🏠 The Library Analogy

Think of KV cache like a growing library. Each new book (token) you add needs shelf space (memory). As your library grows, you need more and more shelves. Eventually, you run out of space and need clever storage solutions!

KV Cache Optimizations

🗜️ Quantization

Compress the cache data from 16-bit to 8-bit or even 4-bit numbers. Less accurate but much smaller!

📄 Paging

Break cache into small "pages" like computer memory. Avoid wasting space on unused memory!

💾 Offloading

Move old cache data to CPU memory or disk when GPU memory gets full.

⚡ Batching: Serving Multiple Users Efficiently

🚌 The Bus Route Analogy

Imagine you're running a bus service. You could send a separate bus for each passenger (inefficient), or you could group passengers going in the same direction and use one big bus (efficient batching)!

Static vs Continuous Batching

❌ Static Batching (Old Way)

User A
5 tokens

User B
50 tokens

User C
5 tokens

→ Everyone waits for User B to finish!

✅ Continuous Batching (New Way)

User A
✓ Done

User B
Token 25/50

User D
New!

→ Users come and go dynamically!

PagedAttention: The Smart Memory Manager

🧩 PagedAttention Explained

Instead of reserving huge chunks of memory for each user, PagedAttention divides memory into small "pages" and assigns them as needed - just like how your computer's operating system manages memory!

❌ Traditional

Reserved: 2048 tokens
Used: 100 tokens
95% wasted!

✅ PagedAttention

Allocated: 100 tokens
Used: 100 tokens
0% wasted!

Performance Impact

🚀 Throughput Improvements

HuggingFace Transformers

1x

Text Generation Inference

3.5x

vLLM (with PagedAttention)

24x

🔄 Disaggregated Serving: Separating Prefill from Decode

🏭 The Factory Assembly Line Analogy

Imagine a factory where one team is really good at preparing ingredients (prefill) and another team excels at final assembly (decode). Instead of having each worker do both tasks poorly, you separate them into specialized stations for maximum efficiency!

The Problem with Traditional Serving

😤 Interference Issues

Prefill: Needs lots of compute, short burst

Decode: Needs consistent memory access, long duration

Together: They fight for resources and slow each other down!

Disaggregated Architecture

🔄 Prefill Cluster

Specialized for parallel processing

Optimized for TTFT

📡 KV Transfer

High-speed network

~17ms for 2048 tokens

📝 Decode Cluster

Specialized for sequential generation

Optimized for throughput

Benefits of Disaggregation

⚡ Better Latency

Prefill doesn't interfere with decode operations. Users get consistent response times.

🎯 Specialized Hardware

Use different GPU configurations optimized for each phase's specific needs.

📈 Higher Throughput

Up to 7x higher request rates with the same SLA requirements.

💰 Cost Efficiency

Scale prefill and decode independently based on actual demand patterns.

Real-World Example

🏢 OpenAI and Google use disaggregated serving!

For ChatGPT: Prefill cluster handles your prompt, then hands off to decode cluster for streaming response generation.

🔮 Speculative Decoding: Predicting the Future

🎯 The Chess Master Analogy

Imagine a chess master playing against a computer. The master can quickly think of several good moves (draft), then the computer carefully verifies which ones are actually legal and best (verification). This way, multiple moves can be planned in the time it usually takes to plan one!

How Speculative Decoding Works

🏃‍♂️ Draft Model

Small, fast model

Generates 3-5 candidate tokens

🔍 Verification

Large target model

Checks all candidates in parallel

✅ Accept/Reject

Keep good predictions

Reject bad ones

Types of Speculative Decoding

🤖 Separate Draft Model

Use a smaller version of the same model (e.g., 7B drafting for 70B)

Speed: 2-3x faster

🔄 Self-Speculative

Use the same model with some layers skipped for drafting

Speed: 1.5-2x faster

🎯 Medusa Heads

Add multiple prediction heads to the main model

Speed: 2-3x faster

🔍 Prompt Lookup

Reuse tokens that already appeared in the prompt

Speed: Very context-dependent

Real Example

🎯 Speculative Decoding in Action

Current context: "The capital of France is"

Draft Model predicts: ["Paris", "located", "in"]

Target Model verifies:
✅ "Paris" - Correct!
❌ "located" - Rejects, generates "and"
⏹️ "in" - Not checked (sequence broken)

Result: Accepted "Paris", generated "and"
Progress: 2 tokens in 1 forward pass! 🚀

Performance Benefits

🚀 Google AI Overviews uses speculative decoding for 2-4x speedup

⚡ Perfect for scenarios where GPU is underutilized (small batch sizes)

🎯 Best results when draft model has 60-80% acceptance rate

🦙 Ollama & GGUF: Running Models on Your Laptop

📱 The Mobile App Analogy

Think of GGUF files like mobile apps that are optimized to run on your phone instead of requiring a powerful desktop computer. They're compressed and efficient versions of the full models that can run on consumer hardware!

What is GGUF?

🗜️ GGUF (GGML Universal File)

G: Georgi (creator's name)

G: Gerganov (creator's surname)

U: Universal

F: File format

A special file format that stores AI models in a compressed, CPU-friendly way!

Quantization Levels

📊 Q4_K_M

Size: ~4GB (70B model)

Quality: Good balance

Speed: Fast

📊 Q8_0

Size: ~8GB (70B model)

Quality: High quality

Speed: Moderate

📊 F16

Size: ~140GB (70B model)

Quality: Original quality

Speed: Slower

How Ollama Works

📥 Download

Pull GGUF model from Hugging Face or Ollama registry

🔧 Load

Load into system memory (RAM/GPU)

💬 Chat

Start chatting with OpenAI-compatible API

Running a Model Example

# Install Ollama
curl -fsSL https://ollama.ai/install.sh | sh

# Pull a model (automatically downloads GGUF)
ollama pull llama3.3:70b-instruct-q4_K_M

# Chat with the model
ollama run llama3.3:70b-instruct-q4_K_M
>>> Hello! How do you work?
I'm an AI running locally on your machine using 
a quantized GGUF format that compresses my 70B 
parameters down to just 4 bits per parameter...

# Use via API
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.3:70b-instruct-q4_K_M",
    "messages": [
      {"role": "user", "content": "Explain inference servers"}
    ]
  }'
            

GGUF vs Traditional Models

🏗️ Traditional PyTorch Model

• Requires GPU with 80GB+ VRAM

• Full 16-bit precision weights

• Complex serving infrastructure

• Size: ~140GB for 70B model

🦙 GGUF Model with Ollama

• Runs on laptop with 32GB RAM

• Quantized 4-bit weights

• Simple single-command setup

• Size: ~40GB for 70B model

Performance Considerations

💻 CPU Inference: 1-5 tokens/second on M2 MacBook

🖥️ GPU Inference: 10-50 tokens/second on RTX 4090

🧠 Memory Usage: Model size + 2-4GB for context

🏭 Inference Server Comparison: Choosing Your Champion

🏎️ The Racing Car Analogy

Different inference servers are like different types of racing cars. A Formula 1 car (TensorRT-LLM) is fastest on a professional track, a rally car (vLLM) works great in various conditions, and a family sedan (TGI) is reliable and easy to drive everywhere!

Server Performance Comparison

🏆 Throughput (Tokens/Second) - Llama 3 70B @ 100 Users

LMDeploy

700 t/s

TensorRT-LLM

700 t/s

vLLM

650 t/s

TGI

650 t/s

⚡ Time to First Token (Lower is Better)

vLLM

Best

TGI

Good

TensorRT-LLM

Variable

LMDeploy

Excellent

Detailed Server Breakdown

🚀 vLLM

Best for: Research, fast TTFT

Strengths: PagedAttention, easy setup

Hardware: NVIDIA, AMD, Intel

Weaknesses: Newer, less enterprise features

⚡ TensorRT-LLM

Best for: Maximum NVIDIA GPU performance

Strengths: Fastest on NVIDIA, FP8 support

Hardware: NVIDIA only

Weaknesses: Complex setup, compilation needed

🛡️ Triton Inference Server

Best for: Enterprise, multi-framework

Strengths: Mature, supports any model

Hardware: All major platforms

Weaknesses: Complex configuration

🤗 Text Generation Inference

Best for: HuggingFace ecosystem, beginners

Strengths: Easy setup, good docs

Hardware: Broad support

Weaknesses: Not always fastest

Decision Tree

🤔 Do you need maximum speed?

YES → TensorRT-LLM
NO → Continue...

🛠️ Do you want easy setup?

YES → vLLM or TGI
NO → Continue...

🏢 Do you need enterprise features?

YES → Triton
NO → vLLM

Real-World Usage

🔥 Production at Scale: Most companies use multiple servers!

📊 Example: TensorRT-LLM for high-throughput batch inference + vLLM for interactive chat

🔄 Trend: Moving toward disaggregated architectures with specialized servers for prefill vs decode

🚀 LLM Inference Servers

🎯 The Fundamentals: How LLM Inference Actually Works

🍽️ The Restaurant Analogy

Step 1: Understanding Tokens

What are Tokens?

Step 2: The Two Phases of Inference

🔄 PREFILL Phase

📝 DECODE Phase

🏃‍♂️ The Reading vs Writing Analogy

Step 3: What Makes This Challenging?

🧠 KV Cache: The Memory That Makes Everything Fast

🎓 The Study Group Analogy

What is KV Cache?

🔑 Key-Value Cache Explained

Why is KV Cache Necessary?

❌ Without KV Cache (Inefficient)

✅ With KV Cache (Efficient)

The Memory Challenge

🏠 The Library Analogy

KV Cache Optimizations

🗜️ Quantization

📄 Paging

💾 Offloading

⚡ Batching: Serving Multiple Users Efficiently

🚌 The Bus Route Analogy

Static vs Continuous Batching

❌ Static Batching (Old Way)

✅ Continuous Batching (New Way)

PagedAttention: The Smart Memory Manager

🧩 PagedAttention Explained

❌ Traditional

✅ PagedAttention

Performance Impact

🚀 Throughput Improvements

🔄 Disaggregated Serving: Separating Prefill from Decode

🏭 The Factory Assembly Line Analogy

The Problem with Traditional Serving

😤 Interference Issues

Disaggregated Architecture

🔄 Prefill Cluster

📡 KV Transfer

📝 Decode Cluster

Benefits of Disaggregation

⚡ Better Latency

🎯 Specialized Hardware

📈 Higher Throughput

💰 Cost Efficiency

Real-World Example

🔮 Speculative Decoding: Predicting the Future

🎯 The Chess Master Analogy

How Speculative Decoding Works

🏃‍♂️ Draft Model

🔍 Verification

✅ Accept/Reject

Types of Speculative Decoding

🤖 Separate Draft Model

🔄 Self-Speculative

🎯 Medusa Heads

🔍 Prompt Lookup

Real Example

🎯 Speculative Decoding in Action

Performance Benefits

🦙 Ollama & GGUF: Running Models on Your Laptop

📱 The Mobile App Analogy

What is GGUF?

🗜️ GGUF (GGML Universal File)

Quantization Levels

📊 Q4_K_M

📊 Q8_0

📊 F16

How Ollama Works

📥 Download

🔧 Load

💬 Chat

Running a Model Example

GGUF vs Traditional Models

🏗️ Traditional PyTorch Model

🦙 GGUF Model with Ollama

Performance Considerations

🏭 Inference Server Comparison: Choosing Your Champion